%matplotlib inline
from utils import *
Clustering using CyberActivd graph embedding was mostly sucessful.
For each graph type - YouTube, GMail, VGame, Attack, Download, and CNN - the top content of each cluster consists of one uniform type of subgraphs.
Tab-separated file with one edge on each line in the following format:
source-id source-type destination-id destination-type edge-type graph-id
Graph ID's correspond to scenarios as follows:
As we saw in the analysis there are different types of graphs. In order to identify similarity in dynamic graphs it is important to account for both
Embeddings of 2 nodes that have great overal in their neighbor nodes should be similar. Closeness in time should also influence the embeddings of 2 nodes to make them more similar.
One more thing is that we are developing an approach to create graph embedding for dynamic graph. The building blocks are subgraphs created from nodes and edges corresponding to a fairly small, predefined, time slice.
First, we convert the cvs data into a sequence of node_edge_node triples.
For example, the following table
| source_id | source_type | destination_id | destination_type | edge_type | graph_id |
|---|---|---|---|---|---|
| 4 | b | 77 | c | u | 0 |
| 4 | a | 77 | f | v | 0 |
| 4 | a | 0 | d | t | 0 |
is encoded as a sentence of three words:
"b_u_c a_v_f a_t_d"
The node and edge names were mapped to single letter by the authors according to the map below. These three words using the map are translated as:
b_u_c = thread open file
a_v_f = process read stdin
a_t_d = process mmap2 file
print_dict(node_edge_map, 6)
We can think of the triples as words, and sequences of such triples as sentences.
We use gensim package and specifically Doc2Vec functionality to convert each word and ultimately the whole sentence into an embedding of a predefined size.
Here are the processing steps implemented in the function process_data in utils.py:
For each graph type (YouTube, GMail, VGame, Attack, Download, and CNN)
graph_idgensim package; convert each sentence into a vector of predefined sizeKMeans from scikit-learnsilhoutte scoreSilhoutte score measures the quality of clustering. To quote this wikipedia article: The silhouette value is a measure of how similar an object is to its own cluster compared to other clusters (separation). The silhouette ranges from −1 to +1, where a high value indicates that the object is well matched to its own cluster and poorly matched to neighboring clusters. If most objects have a high value, then the clustering configuration is appropriate. If many points have a low or negative value, then the clustering configuration may have too many or too few clusters.
n_components = 3
number_graphs = 24
display_config = (8,3)
display(Markdown('# For one graph from each graph type: YouTube, GMail, VGame, Attack, Download, and CNN'))
display(Markdown('### ' + 'Cluster the subgraphs into ' + str(n_components) + ' clusters' ))
display(Markdown('### ' + 'For each cluster display the top ' + str(number_graphs) + ' subgraphs' ))
%%time
graph_id = 51
graph_type = 'YouTube'
test_data, sample_silhouette_values, km, all_graphs, node_colors = process_data(graph_id, n_components, epochs = 100)
for i_comp in range(n_components):
display(Markdown('### ' + graph_type + ' Cluster ' + str(i_comp)))
ith_indices = np.where(km.labels_ == i_comp)[0]
ith_cluster_silhouette_values = sample_silhouette_values[ith_indices]
sorted_index_array = np.argsort(ith_cluster_silhouette_values)[::-1]
sub_graphs = [all_graphs[i] for i in ith_indices]
sub_colors = [node_colors[i] for i in ith_indices]
visualize_graphs(sub_graphs, sub_colors, sorted_index_array[0:number_graphs], display_config)
%%time
graph_id = 180
graph_type = 'GMail'
test_data, sample_silhouette_values, km, all_graphs, node_colors = process_data(graph_id, n_components, epochs = 100)
for i_comp in range(n_components):
display(Markdown('### ' + graph_type + ' Cluster ' + str(i_comp)))
ith_indices = np.where(km.labels_ == i_comp)[0]
ith_cluster_silhouette_values = sample_silhouette_values[ith_indices]
sorted_index_array = np.argsort(ith_cluster_silhouette_values)[::-1]
sub_graphs = [all_graphs[i] for i in ith_indices]
sub_colors = [node_colors[i] for i in ith_indices]
visualize_graphs(sub_graphs, sub_colors, sorted_index_array[0:number_graphs], display_config)
%%time
graph_id = 251
graph_type = 'VGame'
test_data, sample_silhouette_values, km, all_graphs, node_colors = process_data(graph_id, n_components, epochs = 100)
for i_comp in range(n_components):
display(Markdown('### ' + graph_type + ' Cluster ' + str(i_comp)))
ith_indices = np.where(km.labels_ == i_comp)[0]
ith_cluster_silhouette_values = sample_silhouette_values[ith_indices]
sorted_index_array = np.argsort(ith_cluster_silhouette_values)[::-1]
sub_graphs = [all_graphs[i] for i in ith_indices]
sub_colors = [node_colors[i] for i in ith_indices]
visualize_graphs(sub_graphs, sub_colors, sorted_index_array[0:number_graphs], display_config)
%%time
graph_id = 301
graph_type = 'Attack'
test_data, sample_silhouette_values, km, all_graphs, node_colors = process_data(graph_id, n_components, epochs = 10)
for i_comp in range(n_components):
display(Markdown('### ' + graph_type + ' Cluster ' + str(i_comp)))
ith_indices = np.where(km.labels_ == i_comp)[0]
ith_cluster_silhouette_values = sample_silhouette_values[ith_indices]
sorted_index_array = np.argsort(ith_cluster_silhouette_values)[::-1]
sub_graphs = [all_graphs[i] for i in ith_indices]
sub_colors = [node_colors[i] for i in ith_indices]
visualize_graphs(sub_graphs, sub_colors, sorted_index_array[0:number_graphs], display_config)
%%time
graph_id = 450
graph_type = 'Download'
test_data, sample_silhouette_values, km, all_graphs, node_colors = process_data(graph_id, n_components, epochs = 100)
for i_comp in range(n_components):
display(Markdown('### ' + graph_type + ' Cluster ' + str(i_comp)))
ith_indices = np.where(km.labels_ == i_comp)[0]
ith_cluster_silhouette_values = sample_silhouette_values[ith_indices]
sorted_index_array = np.argsort(ith_cluster_silhouette_values)[::-1]
sub_graphs = [all_graphs[i] for i in ith_indices]
sub_colors = [node_colors[i] for i in ith_indices]
visualize_graphs(sub_graphs, sub_colors, sorted_index_array[0:number_graphs], display_config)
%%time
graph_id = 551
graph_type = 'CNN'
test_data, sample_silhouette_values, km, all_graphs, node_colors = process_data(graph_id, n_components, epochs = 100)
for i_comp in range(n_components):
display(Markdown('### ' + graph_type + ' Cluster ' + str(i_comp)))
ith_indices = np.where(km.labels_ == i_comp)[0]
ith_cluster_silhouette_values = sample_silhouette_values[ith_indices]
sorted_index_array = np.argsort(ith_cluster_silhouette_values)[::-1]
sub_graphs = [all_graphs[i] for i in ith_indices]
sub_colors = [node_colors[i] for i in ith_indices]
visualize_graphs(sub_graphs, sub_colors, sorted_index_array[0:number_graphs], display_config)